PredictiveNetwork: predictive gene network estimation with application to gastric cancer drug response-predictive network analysis

Background Gene regulatory networks have garnered a large amount of attention to understand disease mechanisms caused by complex molecular network interactions. These networks have been applied to predict specific clinical characteristics, e.g., cancer, pathogenicity, and anti-cancer drug sensitivity. However, in most previous studies using network-based prediction, the gene networks were estimated first, and predicted clinical characteristics based on pre-estimated networks. Thus, the estimated networks cannot describe clinical characteristic-specific gene regulatory systems. Furthermore, existing computational methods were developed from algorithmic and mathematics viewpoints, without considering network biology. Results To effectively predict clinical characteristics and estimate gene networks that provide critical insights into understanding the biological mechanisms involved in a clinical characteristic, we propose a novel strategy for predictive gene network estimation. The proposed strategy simultaneously performs gene network estimation and prediction of the clinical characteristic. In this strategy, the gene network is estimated with minimal network estimation and prediction errors. We incorporate network biology by assuming that neighboring genes in a network have similar biological functions, while hub genes play key roles in biological processes. Thus, the proposed method provides interpretable prediction results and enables us to uncover biologically reliable marker identification. Monte Carlo simulations shows the effectiveness of our method for feature selection in gene estimation and prediction with excellent prediction accuracy. We applied the proposed strategy to construct gastric cancer drug-responsive networks. Conclusion We identified gastric drug response predictive markers and drug sensitivity/resistance-specific markers, AKR1B10, AKR1C3, ANXA10, and ZNF165, based on GDSC data analysis. Our results for identifying drug sensitive and resistant specific molecular interplay are strongly supported by previous studies. We expect that the proposed strategy will be a useful tool for uncovering crucial molecular interactions involved a specific biological mechanism, such as cancer progression or acquired drug resistance. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04871-z.


Background
Gene networks are crucial for understanding complex disease mechanisms because the molecular mechanisms involved in disease are related to abnormalities in complex molecular networks, rather than in a single gene. To estimate gene regulatory networks, various computational strategies have been proposed and used to identify the molecular interplay involved in specific biological processes (e.g., cancer progression or anti-cancer drug sensitivity). Estimated gene networks have been applied to uncover complex disease mechanisms and identify drug-response marker. Importantly, the effectiveness of network-based analysis has been validated [1,2]. The gene regulatory network has been also applied to predict a specific clinical characteristic, e.g., cancer prediction, pathogenicity prediction, where the network is used as an input of prediction model [3][4][5]. In recent years, predicting drug sensitivity and understanding the molecular mechanisms related to drug resistance in cancer cells has drawn a large amount of attention. Several studies predicted drug sensitivity based on gene networks, e.g., protein-protein interaction (PPI) networks or prior knowledge networks [6][7][8].
Although several computational strategies have been developed for network-based prediction, existing studies estimate gene networks first, and predict a specific clinical characteristic based on pre-estimated networks. Thus, we cannot identify the gene regulatory system characteristics that are related to a specific clinical characteristic (i.e., the object of prediction). Furthermore, existing studies focus only on algorithmic and statistical performance and develop prediction strategies purely from mathematical and computational viewpoints without considering network biology. This leads to difficulty when interpreting the prediction results and when identifying biomarkers.
To address this issue, we propose a novel statistical strategy called a PredictiveNetwork. We consider the response variable-predictive gene network estimation, where the response variable is a specific clinical characteristic. PredictiveNetwork performs lossof-function analysis for gene network estimation and prediction. The objective function of PredictiveNetwork consists of loss functions for gene network estimation and prediction, and thus the gene network estimation and prediction are simultaneously performed. In our strategy, the gene network is estimated by minimizing the losses of estimating gene regulatory systems and response variable prediction, and thus we can uncover prediction specific gene regulatory system, i.e., a prediction-specific gene regulatory network for a clinical characteristic. Furthermore, we incorporate knowledge of network biology into the statistical prediction model based on the network constraint L 1 -type regularization, as described in [9,10]. Genes linked within networks/pathways may have similar biological functions, while hub genes, which interact with many other genes, are crucial markers that play key roles in gene regulation and biological processes [11]. In our strategy, differences in coefficients for neighboring genes in the network are smoothed, followed by simultaneous selection of the related genes. We also encourage that hub genes would have large coefficients and/or would be easily selected in the prediction model. In short, the prediction model is constructed based on crucial subnetworks consisting of hub genes and their target/regulator genes. Thus, we can perform biologically reliable interpretation of the prediction results based on molecular interplay in the subnetwork and reliably identify biomarkers (e.g., uncovering crucial genes and molecular interactions involved in a specific biological process).
We demonstrate the effectiveness of the PredictiveNetwork for prediction and feature selection accuracies in the prediction model and edge selection accuracies in gene network estimation using Monte Carlo simulations. We also applied our strategy to the Sanger Genomics of Drug Sensitivity in Cancer (GDSC) dataset from the Cancer Genome Project to construct drug response-predictive gene network. We performed drug sensitivity prediction and sensitivity-predictive gene network analysis for the FDA-approved gastric cancer drugs, doxorubicin, mitomycin-c, 5-Fluorouracil (5-FU), and docetaxel. Our strategy shows effective prediction results for mitomycin-c, 5-Fluorouracil, and docetaxel sensitivity. Then, we identified predictive drug response markers and characteristics of the associated regulatory systems in drugsensitive and -resistant cell lines. More than half of the identified gastric cancer drug response markers have strong evidences as biomarkers for gastric cancer and anticancer drug responses. In particular, the identified AKR family (AKR1C1, AKR1C3, AKR1B10) are likely anti-cancer drug resistance markers. We identified AKR1C3 and AKR1B10 as having drug resistance characteristics and that their hubness becomes significantly smaller in drug-resistant compared to drug-sensitive cell lines. For the drug sensitive cell lines, activities of ANXA10 and ZNF165 are identified as characteristics and the hubness of ANXA10 in drug-sensitive cell lines is strongly supported by previous studies. Further, ZNF165 may be a novel marker of gastric cancer drug responsiveness. Our results of GDSC data analysis suggest that the molecular interplay between genes of the AKR family, rather than single genes alone, plays key roles in acquired gastric cancer drug resistance. Thus, suppressors of AKR family genes and inducers of ANXA10/ZNF165 may reduce drug resistance of cancer cell lines.
The remainder of this paper is organized as follows: In the "Method" section, we propose a novel strategy for predictive gene network estimation and then describe the numerical solution of the PredictiveNetwork. Then, we show the results of simulation studies in the "Monte Carlo simulations" section. Finally, we describe the results of gastric cancer drug response-predictive gene network analysis. Conclusions are provided in the "Discussion" section.

Method
Suppose X = (x 1 , ..., x n ) T ∈ R n×p is an n × p data matrix describing the expression of p possible regulators that control target gene transcription y j ∈ R n , j = 1, ..., k . Consider the linear regression model, where β jl is the regression coefficient that represents the effect of each regulator x j on its target y j and ǫ j = (ǫ j1 , ..., ǫ jn ) T is a random error vector that is assumed to be independently and identically distributed with mean 0 and variance σ 2 y . Gene regulatory networks are often estimated using the following L 1 -type regularization methods (e.g., lasso, elastic net, fused lasso, etc.) [12][13][14], (1) where and 1 , 2 > 0 are the regularization parameters of β.
Although several statistical methods were developed for network-based prediction models, previous studies performed network estimation and prediction separately, i.e. the gene network was constructed first and then the estimated network was used as the prediction model input [6][7][8]. Furthermore, existing prediction models were developed purely from an algorithmic and statistical point of view, without considering network biology. Thus, the existing methods cannot provide effective interpretation of the prediction results, which causes difficulty in reliable biomarker identification.

Predictive gene regulatory network estimation
To effectively predict a specific biological mechanism and gene networks estimation that provides critical insights into understanding biological mechanisms, we propose a novel statistical method, called a PredictiveNetwork.
Suppose we have n independent observations for the response variables z = (z 1 , ..., z n ) T . We consider response variable-predictive gene network estimation and propose a novel model that enables us to simultaneously estimate gene network and predict the response variable, where θ 0 is an intercept, θ = (θ 1 , ..., θ k ) T is a coefficient vector of gene expression levels for the response variable z , B = (β 1 , ..., β k ) is a p × k matrix describing regulatory systems between genes in the directional network, and 3 > 0 is a regularization parameter to impose sparsity on θ in the prediction model. The first term in (3) indicates the loss function for the prediction of a response variable based on the gene network and the second term is the loss function of gene network estimation.
In the proposed method, the regularity system between genes B is estimated to minimize errors for both network estimation and prediction. In other words, our method can provide prediction-specific network estimation results. Thus, we can identify crucial information for understanding biological mechanisms (i.e., response variables in the prediction model) based on the estimated network.
To achieve more biologically interpretable prediction results, we incorporate network biology into the prediction model using network constraint regularization [9,10]. The estimated network from the second term in (3) can be represented by a weighted graph (2) arg min .., p} is the set of vertices corresponding to p genes and E ∈ V × V is the set of edges. (i, j) indicates a link between vertices i and j (i.e., genes i and j) and W = (w ij ), (i, j) ∈ E is the edge weight. The normalized Laplacian matrix L for the graph is given as G [9,10], where d i is the degree of each gene, which is given as d i = w ij i∼j . In our method, the effect of regulators on their target genes are estimated in B , where the rows and columns of B indicate the regulator and target gene, respectively. Thus, we compute W = w ij based of effect of the ith gene to jth gene (i.e., B ij ) and the jth gene to ith gene (i.e., B ji )as follows, We describe the estimated gene network based on the Laplacian matrix and incorporate the estimated network into the prediction model based on L . We then propose Predic-tiveNetwork as follows, where L a = S T LS with S = diag(sgn(θ 1 ), ..., sgn(θ k )) . The last term in (6) penalizes the differences of the scaled coefficients between the neighboring genes in the network. From the following local quadratic approximation of L 1 -type penalty [15], the last term in (6) can be represented as [10] (4) The genes linked in the networks may have similar biological functions. Thus, we encourage similarity in gene coefficients in the prediction model by using the network constrained penalty. The penalty term enables us to locally smooth the network and encourage the simultaneous selection of related variables, even though neighboring genes have opposite coefficient signs. The hub genes that have many interactions with other genes play a key role in gene regulation and biological processes [11], and thus are crucial markers to understand specific biological mechanisms. In our model, a relatively small penalty is imposed on the hub genes of the estimated networks by re-scaling the coefficients with the square root of the degrees. This scaling causes the hub genes and their regulator/target genes to have relatively large coefficients and/or be easily selected by penalties on the coefficients between neighboring genes. Thus, our method constructs the prediction model based on crucial sub networks, which leads to effective interpretation of the prediction results and predictive marker identification using network biology. In our model, the gene network described by B is estimated to minimize error for not only network estimation but also prediction. It implies that the network is estimated to be optimized for explain the response variable z . Thus, we can effectively uncover the biological mechanism for the response variable based on the estimated gene regulatory network. Furthermore, the coefficient θ explains the change in a response variable as changes of the effect of l th regulator gene on j th target gene (i.e., x lβjl ). That is, the change in specific-biological characteristics (e.g., drug sensitivity of cell lines) is explained by the regulatory system between genes. In short, we can interpret the complex mechanism of disease based on not a single gene but molecular interplays between genes, and it leads to biologically reliable interpretation.

Implementation
To simultaneously perform prediction and gene network estimation, we adapted the following coordinate descent algorithm for estimating B and θ [16,17].
Step 1. The optimization of β jl in (6), given θ and θ 0 has the following solution, where and S(θ, ) is a soft thresholding operator with value Step 2. We then compute W and L based on (4) and (5). The coordinate-wise update of θ j given B , L , and θ 0 has the following form, Step 3. The estimate of θ 0 given B and θ is given as Step 4. Finally, we update the parameters B ,θ,θ 0 cyclically until convergence.

Covariance updates for computational efficiency
The proposed method updates θ for p variables and β jl for p × k variables, simultaneously. This implies that the PredictiveNetwork suffers from computational complexity. Since gene expression data usually consists of a large number of features, the predictive gene network estimation requires a huge computational complexity. To reduce this computational complexity, we considered covariance updates [16].
The coordinate update of B in (9) can be rewritten as follows: where β jl is the estimate of β jl obtained from the previous update, θ jβjr x ir , and r y ij = y ij − p r=1β jr x il . The second term of (13) can be represented and the third term is given as This implies that only the last terms of (14) and (15) are updated for β jl � = 0 . Thus, we can reduce computational complexity for estimating B.
To estimate θ , part of the update in (11) can be rewritten as where θ j is the estimate of θ j obtained from previous update. We update the third term of (16) only when θ j � = 0 , which reduces the computational complexity for estimating θ.

Monte Carlo simulations
Monte Carlo simulations were conducted to investigate the performance of the proposed method. We consider simulation scenarios by benchmark of previous studies on the network-based regularization [9,10]. We simulated gene expression data under the assumed network. We supposed that each transcription factor gene (TF) regulates 10 genes and the TF expression levels are generated from a standard normal distribution.
The expression levels of each of the regulated genes ( y j , j = 1, ...10 ) of the TF ( x t ) were generated based on the expression level of t th TF as follows, where ǫ y j ∼ N (0, σ 2 y ). The response variable z is generated based on the regulatory effect of genes, i.e., based on the gene expression levels X and the effect of regulators on targets B = (β 1 , ..., β k ) as follows, where ǫ z ∼ N (0, σ 2 z ). For response variable predictive gene network estimation, we considered the following four scenarios. The scenarios 1 and 2 are adapted from the works of Li and Li [9]. In order to consider different edge size of regulator on their target genes, which is reasonable for network biology, we also perform simulation studies based on scenarios 3 and 4 in line with Sun et al. [10]. We considered σ 2 y = 0.5 and simulated 50 datasets consisting of n = 200 observations from the 4 scenarios, where the training, validation, and test datasets consisted of 80% (160), 10% (20) and 10% (20) observations, respectively.
For each scenario, we considered the number of TFs (T) as 5, 10, 25, 50, and 100 (No. TFs). We chose the optimal regularization parameter combination that minimized the following mean squared error computed from the validation dataset.
where ẑ i =θ TBT x i , V and n V are the set of indexes and sample size of validation dataset, respectively.
Our method was evaluated by comparing the prediction model based on an independently estimated network with prediction (NW.P), where the gene network was first estimated using the lasso. Then, the prediction of the response variable was based on the estimated network. We also considered prediction based on expression levels rather than networks, i.e. X was used as an input of the prediction model. For these predictions, the lasso (LA), elastic net (EL), XGBoost (XGB) and neural network (NN) were used. For neural network, we used the fully-connected feed-forward neural network with a hidden layer based on ReLU activation function. We compared the prediction results based on the prediction accuracy (mean squared error of the test sets) and the feature selection accuracy (true positive rates, true negative rates, and their average) of θ in the prediction model and of B in the network estimation. Table 1 shows the feature selection accuracies of the genes in the prediction model and the network edges, and the prediction accuracy for σ z = 0.1 . The column "Feature selection of genes" indicates the true positive rates, true negative rates, and their average for θ . The feature selection accuracies for B are given in column "Feature selection of edges". The average mean square errors for 50 datasets are given as the prediction accuracy in the column "MSE". We also show the results for σ z = 0.5 and σ z = 1 in Tables 2  and 3, respectively.
As shown in Table 1, the proposed method shows effective performance for feature selection in the prediction model (i.e. θ ). Although there were not large differences in the true negative rate, our method shows outstanding performance for the true positive rate. Other methods show poor results for true positive results. Our results demonstrate effective overall feature selection and gene selection in the prediction model (i.e., the average of the true positive and negative rates). The outstanding feature selection results in a prediction model (i.e. θ ), which can be also seen for σ z = 0.5 and σ z = 1 , as shown in Tables 2 and 3. Furthermore, the PredictiveNetwork provides effective gene network estimation, i.e. effective edge selection results (i.e. B ) were observed for all scenarios. Thus, our method provides efficient prediction accuracy overall. Although the predictive results are not very different between methods (i.e., Pro, NW.P, EL and LA), our strategy shows low prediction error in most scenarios.
We also illustrate the performances of the methods for various proportions of training dataset, i.e., 50%, 60%, 70% and 80% of n. We generate datasets from the scenarios for 30 TFs, σ = 1 and n = 300 and consider equal size of validation and test datasets. Table 4 shows the results of feature selection of genes (i.e., average of true positive and true negative for θ ) and prediction accuracy, where "ScnX" indicates the scenario X.
As shown in Table 4, the feature selection results are not significantly affected by the proportion of training datasets. On the other hand, the prediction error (i.e., MSE) is getting larger as the proportion of training datasets gets smaller, in overall. The loss of prediction accuracy as the training datasets get smaller proportion can be seen in the results of not only our method but also existing approaches. Although the proposed method suffers from the loss of accuracy in small proportion of training set, superiority of the PredictiveNetwork can also be confirmed for various proportions of training datasets. In short, the proposed method provides effective results for gene network estimation, feature selection, and accuracy of prediction models constructed by incorporating network biology.

Anti-gastric cancer drug sensitivity-predictive gene network analysis
To illustrate the proposed method, we applied our method to estimate drug sensitivitypredictive gene networks. We used a publicly available large scale pharmacogenomic data set, i.e. the "Sanger Genomics of Drug Sensitivity in Cancer (GDSC) dataset from the Cancer Genome Project". The gene expression data (Cell_line_RMA_proc_basalExp. txt) and drug sensitivity data given as the half-maximal inhibitory concentration (IC50) and the Z-score for 345 compounds (GDSC1_fitted_dose_response_25Feb20.xlsx) are obtained from the GDSC dataset (https:// www. cance rrxge ne. org/ downl oads/ bulk_ downl oad). We focused on FDA-approved drugs for stomach (gastric) cancer (https:// www. cancer. gov/ about-cancer/ treat ment/ drugs/ stoma ch). Among the 18 approved drugs, we considered four drugs: doxorubicin, Mitomycin-c, 5-Fluorouracil (5-FU), and Docetaxel that have drug sensitivity values in GDSC dataset. The expression levels of 10% of the genes (976 genes) with the highest variance in all cell lines were used for drug sensitivity-predictive network estimation. For each drug, we matched the expression levels and drug sensitivity (Z-score of IC50 value) for each cell line. The matching returned 948, 855, 891, and 948 cell lines consisting of around 30 cancer types for doxorubicin-, mitomycin-c-, 5-FU-, and docetaxel-sensitive predictive network estimation, respectively. The prediction model consisted of 80%, 10%, and 10% of cell lines for the training, validation, and test datasets. Similar to the simulation study, we evaluated the prediction accuracy of our method by comparing the prediction results based on separately estimated networks (NW.P) and gene expression based prediction by lasso (LA) and elastic net (EL). The prediction results of the gastric cancer drug sensitivities are given in Fig. 1. The proposed method shows effective results for predicting sensitivity to Mitomycin-c, 5-FU, and Docetaxel, while the elastic net showed outstanding performance for doxorubicin.
We next considered drug response predictive marker identification for gastric cancer. Genes having non-zero coefficient values ( θ ) were considered as drug response predictive markers. Markers identified for all four drugs were considered as common markers that predict responses to anti-gastric cancer drugs. Table 5 shows the common markers, where the columns " Gastric cancer drug" and "Gastric cancer" indicate evidence related to the mechanism of the anti-cancer drugs and gastric cancer, respectively.
As shown in Table 5, more than half of the identified common markers were previously identified as markers for anti-cancer drugs and gastric cancer. AKR1C1 is a wellknown marker of drug resistance. The mechanism of AKR1C1 underlying the acquired anti-cancer drug resistance has drawn a large amount of attention.
Marker for anti-cancer drugs

• AKR1C1
Activation of the Nrf2/AKR1C axis contributes to oxaliplatin resistance in TSGH-S3 cells. Manipulating Nrf2/AKR1Cs activity may be useful for managing oxaliplatin-refractory gastric cancer [18]. The AKR1C family is involved in chemotherapy resistance in stomach, colon, lung, and brain cancers [19]. Furthermore, AKR1C1 and AKR1C3 inactivate doxorubicin cytotoxicity and are involved in oxaliplatin-
Overexpression of AKR1C family was observed upon the acquisition of doxorubicin resistance. Various AKR family members are among the most differentially expressed genes upon the acquisition of doxorubicin resistance [20]. These findings suggest that AKRs may play a key role in doxorubicin resistance. Many anti-cancer drugs, e.g., doxorubicin, daunorubicin, and haloperidol, are metabolized by carbonyl-reducing enzymes including AKR1A1, AKR1B1, AKR1B10, AKR1C1, AKR1C2, and AKR1C3.
The aldo-keto reductase (AKR) superfamily is also involved in the development of drug resistance in cancer cells [39]. AKR1C1 and AKR1C3 play a key role in cisplatin resistance in SRCGC by regulating redox-dependent autophagy. Further, AKR1C1 is a crucial regulator of cisplatin-resistance in head and neck squamous cell carcinoma (HNSCC) and is a poor prognostic factor for HNSCC patient death [21]. Finally, AKR1C1 is upregulated by IL-6 and Nrf2 and promotes acquired cisplatin-resistance in metastatic ovarian and gastric cancer cells. The overexpression of AKR1B1, AKR1C1, AKR1C2, and AKR1C3 is responsible for the early appearance of doxorubicin drug resistance [22]. Sensitivity to cisplatin, cis-diamminedichloroplatinum (CDDP), and 5-FU is restored when AKR1C1, AKR1C2, AKR1C3, and AKR1C were knocked down. Inhibiting AKR1C family genes enhances sensitivity to CDDP and 5-FU [23]. AKR1C1 plays an crucial role in drug resistance in bladder cancer cells [24]. Overexpression of AKR genes in cancer cells that are resistant to chemotherapeutic agents (i.e., cisplatin, doxorubicin, duanorubciin, mitomycin, emozolomide, cyclophosphosphamide, and oracin) is common [40]. Resistance to enzalutamide is caused by AKR1C3 overexpression. • CRYAB CRYAB protein levels are significantly reduced after doxorubicin treatment [30]. PDCryab1, a peptide from CRYAB, is a candidate for protecting the myocardium against doxorubicin-induced cell apoptosis.

• PEG10
Knocking down PEG10 enhances the sensitivity of MKN7 cells (human gastric adenocarcinoma cells) to docetaxel [35]. Inhibiting PEG10 expression enhances the effect of 5-FU on apoptosis, and PEG10 is upregulated in cases treated with neoadjuvant docetaxel. [36].

Markers for gastric cancer
• AKR1C1 AKR1C1 is a potential biomarker and therapeutic target for gastric cancer [25].
• CRYAB CRYAB is a therapeutic target for gastric cancer [31]. CRYAB is a prognostic biomarker and therapeutic target in human solid tumors. The role of CRYAB in anti-cancer invasion and metastasis via epithelial-mesenchymal transition (EMT) was recently uncovered [32]. Increased CRYAB expression is associated with poor overall survival in digestive system cancer patients. CRYAB contributes to gastric cancer cell migration and invasion via EMT, which is mediated by the NF-κB signaling pathway. Tao et al. [33] demonstrated that high CRYAB levels are related to angiogenesis and poor prognosis in gastric cancer. • TMEM139 TMEM139 was identified as a differentially expressed gene in intestinal metaplasia that does not progress to gastric cancer [34].
PEG10 is a high lymph node ratio-associated gene whose expression is positively correlated with pathological stage III gastric cancer [35]. Knockdown of PEG10 suppresses proliferation, invasion, and decreases chemo-resistance in gastric cancer cells. Silencing lncRNA PEG10 inhibits the occurrence and progression of gastric cancer [38].
ZG16B is a tumorigenic factor and diagnostic marker in pancreatic, gastric, colon, ovarian, oral squamous cell, and cervical carcinoma [29]. To uncover common regulatory systems involved in gastric drug responses, we constructed a gene network based on the target and regulator genes of the common markers. We consider the regulator (target) genes of the common markers in the estimated networks for more than one drug as common regulators (target) of the identified markers. Hereafter, we refer the common genes, common regulator and target genes as identified gastric cancer drug markers. For the identified markers, we extract their networks from B doxorubicin , B mitomycin-c ,  Fig. 2 shows the gene regulator network described by B CM jl . AKR1C1, a crucial marker of drug resistance, is a hub gene in gastric cancer drug sensitivity-predictive gene networks. Strong interaction between AKR1C1 and AKR1C3 was present in the network, which implies that AKR family genes are tumorigenic factors and diagnostic markers for gastric cancer. Further, our results suggest that these genes may be activated by molecular interactions rather than the activity of a single gene. There is abundant evidence that the AKR1C family, including AKR1C1 and AKR1C3, plays a key role in drug resistance of gastric cancer. These findings imply that our method provides biologically reliable results and that the constructed network for gastric cancer drugs may have crucial information to uncover mechanism-related therapeutic resistance and chemotherapy effectiveness. Importantly, these factors cannot be identified without considering molecular interactions.
To identify gastric cancer drug-sensitive and -resistant molecular interactions, we estimated drug-sensitive and -resistant networks based on identified gastric cancer drug markers. We extracted data from 400 drug-sensitive and -resistant cell lines. This data corresponded to the 400 largest (resistant) and 400 smallest (sensitive) drug sensitivity values of each drug. We then extracted the overlapping drug-sensitive and -resistant) cell lines for the four drugs (95 drug sensitive and 95 drug resistant cell lines were extracted) and estimated gene networks based on these cell lines. For each 95 drug sensitive and 95 drug resistant cell lines, we estimate drug sensitive and resistant gene networks (i.e., NW st and NW rs ) consisting of the 10% of the genes (976 genes) having the highest variance by using the lasso, i.e., the regulatory system between target gene y j and regulator genes x l is estimated as β jl by (2) with 1 = 0 . The gene networks are described by  Edge thickness represents the strength of effect of regulator on target genes (i.e., |β| ) and color indicates sign of the effect (red: "-" and blue: "+"). Node size represents degree of connectivity (i.e., hubness) of each gene in the networks AKR1B10 is a characteristic of drug-resistant gastric cancer cells. Indeed, the hubness of AKR1B10 was significantly smaller in drug-resistant cells compared to sensitive cell lines. The hubness of AKR1C3 also became smaller in drug-sensitive cell lines compared with networks estimated from drug-resistant cell lines. The activity of AKR family genes in drug resistant cell lines was strongly supported by previous studies [19-24, 39, 40].
In contrast, the hubnesses of ANXA10 and ZNF165 are characteristics of drug-sensitive gastric cancer cells. ANXA10 and AKR1C1 showed strong interactions in drug-sensitive cell lines, while their interaction disappeared in drug-resistant cell lines. The hubness of ZNF165 becomes weaker from the drug sensitive to resistant cell lines. Thus, high activities of ANXA10 and ZNF165 can be considered as signatures of drug-sensitive cell lines. It has been identified that ANXA10 is a crucial marker of gastric cancer and its related mechanism for drug sensitivity has been uncovered as follows. ANXA10, a novel gastric marker, shows extensive tissue and subclonal heterogeneity of dual stomachintestinal cell states [41]. Additionally, ANXA10 is significantly upregulated in gastric carcinoma and downregulated in gastric carcinogenesis [42]. Overexpression of ANXA10 in MKN-1 human gastric adenosquamous carcinoma cells leads to cell growth and increased apoptotic cells. These results suggest that ANXA10 plays a crucial role as a tumor suppressor in gastric cancer cells by restraining cell growth and inducing basal apoptosis. In around half of gastric cancer cases, ANXA10 is detected. Loss of ANXA10 is significantly correlated with disease progression and poor clinical outcomes in gastric cancer [43]. Repressed cell growth was observed in ANXA10-knockdown human gastric organoids. Hierarchical clustering showed that KLK6 and ANXA10 are enriched in cancer organoids that showed higher sensitivity to erlotinib [44]. Overexpression of ANXA10 in human epithelial cancer cells increases sensitivity to doxorubicin-induced apoptosis and reduces clonogenic ability [45]. ZNF165 is a novel cancer antigen capable of eliciting humoral immune responses and is involved in tumour biology [46]. Further, ZNF165 is expressed in gastric cancer, colon cancer, and non-small-cell lung carcinoma. Previous studies strongly support our data-driven results that high ANXA10 activity is a characteristic of drug-sensitive gastric cancer cells. Drug-sensitive and/or -resistancespecific molecular interactions may be crucial clues for uncovering drug resistance/senstivity mechanisms. We show regulatory effects for drug-sensitive and -resistance-specific markers, where the regulatory effect is computed by the combined regulator expression level and the effect of the regulator on its target gene. For the drug sensitive and resistant cell lines, the regulatory effect of l th regulator gene on j th target gene is described by  Fig. 2). Thus, gene activity can be described by the regulatory effect. Figure 3 shows the regulatory effect of the identified gastric cancer drugsensitive and -resistant markers on their targets (row: Targets) and the regulatory effect of their regulators on the identified markers (row: Regulators) in drug-sensitive and -resistant cell lines. As shown in Fig. 3, the identified drug sensitive markers ANXA10 and ZNF165 showed high activity in drug-sensitive cell lines, especially the effects of their regulators on ANXA10 and ZNF165 are disappeared in drug resistant cell lines (i.e., β rs = 0 ). Especially, their regulators show a large regulatory effect on ANXA10 and ZNF165 in drug-sensitive cell lines. These effects become smaller in the drug-resistant cells. The drug resistance markers also showed clearly different activities between drugsensitive and -resistant cell lines. The genes regulating AKR1B10 act only in drug  Fig. 2. "Targets" indicates regulatory effect of the identified markers on their target genes and "Regulators" indicates regulatory effect of the genes on the identified markers in drug sensitive and resistant networks (i.e., NW st and NW rs ) resistant cell-lines, and interactions between regulators and AKR1B10 disappeared in drug-sensitive cells.
We then identify molecular interactions that show significantly different regulatory system between drug-sensitive and -resistant cell lines. We randomly selected 95 cell lines and estimated networks (permuted drug-sensitive networks) similar to the networks in the bottom of Fig. 2. We also estimated permuted drug-resistant networks based on 95 randomly-selected cell lines. The differences between the two networks were computed for 1000 randomly-selected permutation samples ( DF (pm), pm = 1, ..., 1000 ). We computed the following permutation p value based on the difference between the two networks given in Fig. 2 ( DF true ) and DF(pm), where I(·) is the indicator function. The p value pm indicates the proportion that the absolute difference of edges in drug sensitive and resistant cell lines ( DF true ) is smaller than the absolute differences computed from the 1000 permuted samples ( DF pm ). The p value pm indicates that two genes show extremely different regulatory system between drug sensitive and resistant cell lines. Figure 4 shows the p value perm of permutation test to test difference of the regulatory system of the identified gastric cancer drug markers, where the markers are considered as not only regulator (rows) but also target (columns) genes. The identified drug resistance marker AKR1B10 showed significantly different regulatory effects on AKR1C3. Furthermore, another drug resistance marker, ZNF165, had several significantly different molecular interactions for its target genes (i.e., GPX3, MAL, PEG10, and SLITRK6). Further, the drug sensitivity markers AKR1C3 and ANXA10 (18) p value pm = 1000 pm=1 I(|DF true | < |DF (pm)|)) 1000 had significantly different regulatory systems with their regulators (AKR1C3: GPC4 and PEG10, ANXA10: AKR1C1). In addition to ANXA10, ZNF165, AKR1C1 and AKR1B10, we identified several more genes as common predictive markers for the four gastric cancer drugs (i.e., genes in the networks) that showed significant differences in the networks of drug sensitive and resistant cell lines (p value < 0.05). These results imply that identified gastric cancer drug markers have significantly different regulatory systems in the gene networks between the drug-sensitive and -resistant cell lines. From these results, we suggest that the high activity of AKR family genes may be involved in acquired drug resistance. Thus, controlling suppressors of AKR family genes may enhance the sensitivity to gastric cancer drugs (Additional file 1: Table S1). Our results also suggest that loss of ANXA10 activity may lead to drug resistance. Although little evidence for the role of ZNF165 were found for mechanisms related to gastric cancer or anti-cancer drugs, it can be suggested that ZNF165 is a novel marker of gastric cancer drug responses. The controlling inducers of ANXA10 and ZNF165 may lead to enhanced drug sensitivity (Additional file 1: Table S1).

Discussion
We introduced a novel computational strategy for response variable predictive gene network estimation. To identify biological mechanism-specific gene networks, we propose a model that consists of gene network estimation and prediction of a specific biological process. Furthermore, we incorporated network biology into the prediction model, which enabled the PredictiveNetwork to simultaneously perform gene network estimation and prediction. Our method estimates gene networks that achieves minimized prediction and network estimation errors. Thus, we can identify response predictionspecific characteristics of gene networks. Additionally, our method can construct prediction models based on crucial subnetworks involved in specific biological processes. These lead to effective interpretation of prediction results and biologically-reliable predictive marker identification.
Graph Attention Networks (GAN) [47] is a strategy that also performs network estimation and prediction, simultaneously. GAN is a neural network approach that leverages masked self-attentional layers based on similarity of node in neighborhoods. Thus, gene regulatory system can be described by clinical characteristic-specific self-attention network. On the other hand, the gene network estimation procedure of the Predic-tiveNetwork can be considered as clinical characteristic specific graphical gaussian modeling and the estimated gene regulatory network is given as weighted-directed adjacency matrix.
To illustrate the proposed strategy, we performed Monte Carlo simulations. The simulation results showed that the proposed strategy has outstanding performance for feature selection in gene network estimation and prediction. Our results also demonstrate excellent prediction accuracy. We applied the proposed PredictiveNetwork to estimate gene networks that are responsive to gastric cancer drugs. Using the GDSC dataset, we estimated doxorubicin, Mitomycin-C, 5-FU, and Docetaxel-responsive gene networks. The identified gastric drug response markers showed significantly different regulatory